YEASTRACT+: a portal for the exploitation of global transcription regulation and metabolic model data in yeast biotechnology and pathogenesis

Abstract YEASTRACT+ (http://yeastract-plus.org/) is a tool for the analysis, prediction and modelling of transcription regulatory data at the gene and genomic levels in yeasts. It incorporates three integrated databases: YEASTRACT (http://yeastract-plus.org/yeastract/), PathoYeastract (http://yeastract-plus.org/pathoyeastract/) and NCYeastract (http://yeastract-plus.org/ncyeastract/), focused on Saccharomyces cerevisiae, pathogenic yeasts of the Candida genus, and non-conventional yeasts of biotechnological relevance. In this release, YEASTRACT+ offers upgraded information on transcription regulation for the ten previously incorporated yeast species, while extending the database to another pathogenic yeast, Candida auris. Since the last release of YEASTRACT+ (January 2020), a fourth database has been integrated. CommunityYeastract (http://yeastract-plus.org/community/) offers a platform for the creation, use, and future update of YEASTRACT-like databases for any yeast of the users’ choice. CommunityYeastract currently provides information for two Saccharomyces boulardii strains, Rhodotorula toruloides NP11 oleaginous yeast, and Schizosaccharomyces pombe 972h-. In addition, YEASTRACT+ portal currently gathers 304 547 documented regulatory associations between transcription factors (TF) and target genes and 480 DNA binding sites, considering 2771 TFs from 11 yeast species. A new set of tools, currently implemented for S. cerevisiae and C. albicans, is further offered, combining regulatory information with genome-scale metabolic models to provide predictions on the most promising transcription factors to be exploited in cell factory optimisation or to be used as novel drug targets. The expansion of these new tools to the remaining YEASTRACT+ species is ongoing.


INTRODUCTION
Yeasts are a diverse group of unicellular fungal species with a strong impact on human life. The most well-known yeast is by far Saccharomyces cerevisiae, long used unknowingly for its alcoholic fermentation ability in the brewer and wine industries, but also in the production of bread and other dough-based products. Given its early biotechnological success, its genetic amenability and its genome fully sequenced since 1996 (1), S. cerevisiae has been exploited as a cell factory for the industrial production of many added-value compounds (2). Recent years have seen a tremendous increase in the number and variety of yeast species displaying a biotechnological potential thanks to their natural properties. Among them are the methylotrophic yeast Koma-D786 Nucleic Acids Research, 2023, Vol. 51, Database issue gataella phaffii (formerly Pichia pastoris), a favourite host for recombinant protein production (3); the weak acidresistant food spoilage yeast Zygosaccharomyces baillii (4); Kluyveromyces lactis, widely used in cheese production (5); the thermotolerant yeast Kluyveromyces marxianus (6) and the oleaginous yeast Yarrowia lipolytica (7).
On the other end of the spectrum lay pathogenic yeasts of the Candida genus, major causative agents of human systemic fungemia, and responsible for more than 400,000 in life-threatening infections worldwide every year (8). Candida albicans, Candida glabrata, Candida parapsilosis and Candida tropicalis are the most prevalent among candidiasis patients, accounting for >90% of all Candida infections (9). More recently, Candida auris arose as a pathogen of concern, being associated with the first cases of candidiasis outbreaks in hospital environments, and displaying unusual resistance to the currently available antifungal armamentarium (10).
A complete understanding of the molecular and regulatory mechanisms that control the productivity in biotechnologically-relevant yeasts is key to guiding the design of more effective cell factories. Simultaneously, understanding the molecular mechanisms that control phenotypes related to pathogenesis in human pathogens is essential to guide the design of more effective therapeutic options. One of the most promising Systems Biology based methodologies to address both issues is the use of Genome-Scale Metabolic Models (GSMMs), which provide a simplified, yet comprehensive, view of the full metabolism of an organism, and enable the simulation of the system's behaviour. Indeed, metabolic engineering based on GSMMs has been successful in optimising the production of addedvalue compounds in yeasts (11). In parallel, GSMMs have also been exploited in the search for promising new drug targets, by facilitating the prediction of gene essentiality in pathogenic organisms (12). However, the lack of integration of regulatory information in the currently available GSMMs hinders their predictive ability, preventing the ability to identify transcription factors as promising targets for metabolic engineering or for the design of new antifungal drugs.
In this release, the most recent YEASTRACT+ upgrade is presented, including up-to-date curated information on all published regulatory associations between transcription factors (TFs) and target genes or TFs and their DNA binding sites. Besides the previously integrated YEASTRACT (13)(14)(15)(16)(17)(18), PathoYeastract (19) and NCYeastract (20) databases, it also presents upgrades in three dimensions: (i) the introduction of a fourth database, Com-munityYeastract; (ii) the integration of C. auris, another Candida species in the PathoYeastract database and (iii) a set of new computational tools, that combine regulatory data with genome-scale metabolic models, aiming the prediction of the most promising TFs to be exploited in cell factory or to be used as novel drug targets.

DATA UPDATE AND UPGRADE
In this paper, the upgrade of the YEASTRACT+ portal is presented, including updates on the YEASTRACT,  Top: set of options with the selected medium, rank criterion and metabolite highlighted in red. Bottom: table listing the genes whose expression manipulation is predicted to optimise the production of the selected metabolite, obtained by simulating the selected GSMM and the selected medium. Here, the predicted production is given in terms of the exchange reaction flux. The impact on the metabolite production of various gene expression manipulations is displayed, from full gene Knock Out (KO) or decreased expression (UE, Under-Expression, from 0 to 0.75fold the wild-type levels), to increased expression (OE, Over-Expression, from 1.25-to 1.5-fold the wild-type levels). Cells shaded in salmon contain values lower than the WT exchange reaction flux or infeasible cases. The last two columns highlight the highest metabolite production and associated manipulation of gene expression, followed by the 'View' link for details on the changes imposed on reaction fluxes by said manipulation.
PathoYeastract and NCYeastract databases, as detailed in Table 1.
YEASTRACT, focused on S. cerevisiae, currently includes 215 398 regulatory associations between TFs and target genes, as well as 310 associations between TFs and TF binding sites, which corresponds to a 5% increase in the amount of available data since its latest release. Data on transcriptional regulatory associations in NCYeastract was also updated. Specifically, 1%, 0%, 4.3%, 0.9% and 0.5% in-creases in the number of regulatory associations between TFs and target genes, experimentally determined in Komagataella phaffii, Zygosaccharomyces baillii, Kluyveromyces lactis, Kluyveromyces marxianus and Yarrowia lipolytica, respectively, were registered in the last 2 years. In the case of PathoYeastract, the number of regulatory associations between TFs and target genes deposited in the database increased 2%, 114%, 0% and 0.4% for C. albicans, C. glabrata, C. parapsilosis and C. tropicalis, respectively. Additionally, a fifth species of pathogenic yeast was included in the database, C. auris. Despite the fact that relatively little is yet known about this emergent species, its predicted impact on the clinical development of recalcitrant candidiasis, associated with hospital outbreaks of the disease, led us to provide the community with this resource, which currently includes only seven experimentally characterised associations between TFs and target genes.
All TF-target gene and TF-TF binding site associations deposited in YEASTRACT+ are provided with specific information on the underlying publication, the experimental setup used to identify each regulatory association, including classification of the used approach as either based on DNA binding (e.g. Chromatin ImmunoPrecipitation (ChIP), ChIP-on-chip, ChIP-seq and Electrophoretic Mobility Shift Assay (EMSA)) or Expression (e.g. RT-PCR, microarray hybridisation, RNA sequencing or expression proteomics) data, as well as information on the environmental conditions in which each association was found to take place.
The increasing exploitation of a variety of yeast species of biotechnology or medical interest constitutes a challenge, as many of them are poorly characterised, particularly in terms of their transcriptional networks. The lack of data in these organisms, especially when compared with the model yeast S. cerevisiae, can, at least partially, be compensated by the use of comparative genomics approaches. These permit the exploitation of the knowledge of well-known organisms to predict the function and regulation of orthologous proteins in poorly characterised or uncharacterised systems. Naturally, given that the conservation of gene and TF function, TF binding sites and regulatory associations among different species is not complete, results obtained through this comparative genomics approach should be regarded as merely indicative, requiring experimental validation. Still, with this in mind, the possibility of expanding YEASTRACT+ to an unlimited number of yeast species, for which no specific regulatory data is gathered, but whose genomic sequence can be used to predict gene and genomewide regulatory pathways, led to the development of Com-munityYeastract. Here, the predicted production is given in terms of the exchange reaction flux. The impact on the metabolite production of TF Knock Out (KO) or Over-Expression (OE) is displayed, for different effects of the TF expression manipulation on its target genes (TGs). A TF Knock Out (KO) effect ranges from Under-Expression (UE) of its activated TGs (from 0 to 0.5fold the wild-type levels) to OE of its repressed TGs (from 1.25-to 1.5-fold the wild-type levels). A TF OE effect ranges from OE of its activated TGs (from 0 to 0.5-fold the wild-type levels) to UE of its repressed TGs (from 1.25-to 1.5-fold the wild-type levels). The last two columns highlight the highest level of the metabolite production and associated manipulations of TGs expression, followed by the 'View' link for details on the changes imposed on reaction fluxes by said manipulations.
CommunityYeastract (Community Yeast Search for Transcriptional Regulators And Consensus Tracking) is a repository of automatically generated YEASTRACT-like databases, for yeast species or strains, according to the request of community members (20). No data on transcription associations documented for the specific organism is included. However, all YEASTRACT+ queries may be run on genes or datasets of the specific organism, considering regulatory information of homologous genes in related yeast species fully described in YEASTRACT, PathoYeastract and NCYeastract.
CommunityYeastract currently provides information for two probiotic Saccharomyces boulardii strains, Biocodex and Unique 28 (25), the oleaginous yeast Rhodotorula toruloides NP11 (26), and the model fission yeast Schizosaccharomyces pombe 972h-. Tools to automatically generate YEASTRACT-like databases, based on genome sequences, were provided elsewhere (26). However, the YEASTRACT team welcomes requests from its users or potential users to add additional yeast species to CommunityYeastract.

INTEGRATION OF GENOME-SCALE METABOLIC MODELS WITH REGULATORY INFORMATION: NEW TOOLS FOR STRAIN OPTIMISATION AND DRUG TARGET IDENTIFICATION
Genome-Scale Metabolic Models (GSMMs) aim to provide a reconstruction of the whole metabolism of an organism, through its description as a mathematical model. Top: set of options with the selected medium, and whether essentiality is evaluated for genes, reactions or TFs, as highlighted in red rectangles. Bottom left: table of results of essential genes for RPMI medium. Predicted essential genes, as defined by COBRApy, are those whose single Knock Out (KO) is predicted to lead to biomass production flux below 1% of that of the wild-type strain. The biomass production flux predicted upon in silico deletion of the indicated gene/ORF is displayed. Although KO and UE=0.0 (Under-Expression) appear to be the same, the simulation tools handle them differently for reactions having multiple enzymes with the same function. In such a case, the gene 'KO' is simulated as having no impact on reaction flux (the other isoenzymes are supposed to fully replace the deleted one), while 'UE=0.0' is simulated by a decrease of the reaction flux, inversely proportional to the number of isoenzymes (e.g. if 3 isoenzymes catalyse a reaction, the deletion of one of the coding genes will lead to a 33% reduction of the reaction flux). Bottom right: predicted essential reactions for RPMI medium, as defined by COBRApy, that is those whose blockage (reaction flow = 0) leads to biomass production flux below 1% of that of the wild-type strain. The predicted biomass production flux is displayed together with the reaction ID/and name. Figure 4. Depiction of the 'Essentiality' prediction query when looking for essential TFs in Candida albicans. Top left: set of options with the selected medium, and whether essentiality is evaluated for genes, reactions or TFs, as highlighted in red rectangles. Bottom: table of essential TFs for the RPMI medium. Predicted essential TFs, as defined by COBRApy, are those whose single Knock Out leads to biomass production flux below 1% of that of the wild-type strain. The predicted biomass production flux upon in silico Knocking Out (KO) or Under-Expression (UE) of each TF is displayed, considering impacts on the expression of its target genes (TGs) ranging from UE activated TGs (from 0 to 0.5-fold the wild-type levels) to Over-Expression (OE) of repressed TGs (from 1.25-to 1.5-fold the wild-type levels). Top right: regulatory network of one of the identified essential TFs, Upc2, and the genes whose expression is controlled by that same TF. This visualisation was obtained by following the corresponding 'View' link, in the results table. Highlighted in the red circle are the seven Upc2 TGs predicted to be essential in the same environmental conditions. The first GSMM was built for Haemophilus influenzae, in 1999 (27), followed by Escherichia coli, in 2000 (28), and by S. cerevisiae, in 2003 (29). Throughout the last two decades, numerous GSMMs have been constructed, including some dedicated to multicellular organisms, including humans (30). GSMMs contain three main levels of information: metabolites, reactions and metabolic genes.
The relationships between metabolites and reactions can be described by a stoichiometric matrix and the ones between reactions and genes by a binary matrix. A wellconstructed model enables the simulation of an organism's behaviour -i.e. how much of each metabolite is produced or consumed--in a given medium/environmental condition, all done in silico, with constraint-based modelling (30). Despite many efforts for integrating omics data, including transcriptomics, proteomics, metabolomics and fluxomics data, into available metabolic models, it is still not possible to integrate full regulatory data in any of the currently available metabolic models.
In this YEASTRACT+ release, automated tools to exploit current yeast GSMMs are provided for S. cerevisiae and C. albicans, relying on COBRApy (31). The expansion of their use to all other yeasts in the database is envisaged. Two main goals can now be achieved using the proposed new queries: (i) the prediction of the genes whose expression manipulation may lead to increased production of a chosen metabolite, in a metabolic engineering perspective and (ii) the prediction of the genes that may be used as drug targets, based on their essentiality in chosen conditions. Thanks to the integration of regulatory information, it is also possible to predict the TFs whose expression is worth manipulating to optimise metabolite production or the TFs that may be considered promising drug targets. Details on how these new tools can be used in these contexts, follow.

Prediction of metabolic and TF encoding genes envisaging cell factory optimisation
Using the new YEASTRACT query 'Predict [metabolite] optimisation by manipulating gene expression', it is possible to search for the genes whose deletion, down-regulation or up-regulation is predicted to improve the production of a Nucleic Acids Research, 2023, Vol. 51, Database issue D789 metabolite of interest. The S. cerevisiae GSMM model currently used in YEASTRACT+ is Yeast8 (32). The query includes the selection of a specific growth medium. Currently, two growth media are available -Synthetic Minimal medium and Glucose-rich Synthetic Complete medium -whose compositions are shown by clicking the link 'model/medium'. Aiming the optimisation of the production of a chosen compound, genes or TFs predicted to be of interest can be ranked according to one of the three criteria: 'Reaction flux', 'Biomass-Product Coupled Yield (BPCY)' or 'Product Yield with Minimum Biomass (PYMB)'.
For example, to identify genes whose expression manipulation may increase ethanol production in S. cerevisiae, 'Glucose-rich SC medium' is selected, as it mimics a situation of high glucose availability and low oxygen availability, which is typical of industrial alcoholic fermentation (Figure 1). The metabolite of interest is defined in the appropriate box 'ethanol exchange'. Upon clicking the 'Search' button, the results are displayed in a table format, listing the genes whose expression manipulation is predicted to optimise ethanol production, in the pre-selected conditions ( Figure 1). Predicted metabolite production is given in terms of metabolite exchange flux. The impact on metabolite production of different changes in gene expression is displayed in the table, from full gene Knock Out (KO) or decreased gene expression (UE, Under-Expression, from 0 to 0.75-fold the wild-type levels), to increased gene expression (OE, Over-Expression, from 1.25-to 1.5-fold the wildtype levels) (33). The final columns highlight the highest level of metabolite production with the manipulation leading to that level, followed by the 'View' link, which allows obtaining details on the changes imposed on reaction fluxes by said manipulation. In this case, the over-expression of 67 genes or the deletion/down-regulation of 106 genes is expected to result in a moderate increase in ethanol production. For example, increasing the expression of PDC1, PDC5 or PDC6, encoding three pyruvate decarboxylases, is predicted to increase ethanol production, possibly by increasing the production of acetaldehyde, which may then be converted into ethanol by alcohol dehydrogenases. Another suggested route for increased ethanol production is the deletion of any one of the 18 ATP genes, encoding subunits of the F1F0-ATP synthase that catalyse the last step of oxidative phosphorylation, which requires the consumption of ethanol or ethanol precursors, through respiration.
The most novel outcome of this new set of tools is obtained with the query 'Predict [metabolite] optimisation by manipulating Transcription Factor (TF) expression' (Figure 2). This tool enables the identification of the TFs whose deletion, down-regulation or up-regulation is predicted to improve the production of a metabolite of interest. Here again, the user may choose 'Glucose-rich SC medium', and 'ethanol exchange' as the reaction to be optimised. It is possible to filter the regulations to be considered, selecting documented regulations with expression evidence, positive and/or negative, or additionally requiring DNA binding evidence. Once the 'Search' button is clicked, the results are displayed in a table listing the TFs whose expression manipulation is predicted to enable the optimisation of ethanol production, in the pre-selected conditions (Figure 2). Predicted metabolite production is given in terms of metabo-lite exchange flux. The impact on metabolite production of TF KO or OE is predicted, considering a wide range of possible effects of the TF on the expression of its activated and repressed target genes (UE, Under-Expression, of TF activated target genes from 0 to 0.5-fold the wild-type levels; OE, Over-Expression, of TF repressed target genes from 1.25 to 1.5-fold the wild-type levels). The final columns highlight the highest level of the metabolite production obtained by the expression manipulation of each TF, followed by the possibility to 'View' details on the changes imposed on reaction fluxes by said manipulations. In this case, the over-expression of 18 TFs or the deletion of 23 TFs is expected to result in a moderate increase in ethanol production. For example, increasing the expression of PDC2 TF encoding gene is predicted to increase ethanol production. Interestingly, Pdc2 controls the expression of PDC1 and PDC5, whose own over-expression is predicted to increase ethanol production, as discussed above. On the other hand, the KO of MIG1, GCR2 or HAP2, encoding TFs involved in the control of glucose repression, glycolysis and respiration, respectively, are predicted to lead to increased ethanol production, likely through their effect on the expression of a combination of central carbon metabolism genes. As far as our knowledge goes, the impact of the expression level of these TFs on ethanol production has never been evaluated.
If the user wishes to use a growth medium or a yeast model that is not currently available at YEASTRACT, (s)he is invited to contact our support team to evaluate its importance and to make it available to the wider community.

Prediction of metabolic and TF encoding genes as promising drug targets
The new 'Essentiality' prediction query is offered to YEAS-TRACT+ users, particularly with the aim of identifying new drug targets. The use of this new tool can be exemplified in the case of the human pathogen C. albicans. The C. albicans GSMM model currently used by YEAS-TRACT is iRV781 (34). The query includes the selection of a specific growth medium. Currently, two growth media are available--Synthetic Minimal Medium and RPMI 1640 medium--whose compositions are shown by clicking the link 'model/medium'. The essentiality search can be performed by looking for essential genes, essential reactions (which may be coupled to several metabolic genes) or essential TFs.
For example, if the user wishes to identify C. albicans metabolic genes, which are essential under conditions found in the human host environment, 'RPMI 1640 medium' may be selected as it mimics human serum (Figure 3). Upon selecting 'Genes' and once the 'Search' button is clicked, the results are displayed in a table format, listing the genes whose deletion leads to biomass production flux below 1% of that of the wild-type strain, in the selected growth medium (Figure 3). Consistent with the proposed applicability of this approach, among the list of identified essential genes are ergosterol biosynthesis genes, including ERG11, which encodes the target of the currently used family of azole antifungal drugs, as reviewed in (35). Remarkably, the GSC1 gene, encoding the target of echinocandin antifungal drugs, is not identified as an essential gene, in this query. The reason for this is that in C. albicans there are two paralogs of GSC1, GSL1 and GSL2, which are predicted to maintain cell viability when GSC1 is absent. For such cases, searching for essential reactions, instead of essential genes, is more promising. When using the 'Essentiality' prediction query, selecting 'Reactions', the results displayed in a table format, provide the list of reactions whose blockage (reaction flow = 0) is predicted to lead to biomass production flux <1% of those of the wild-type strain, in the selected growth medium. In this list of essential reactions it is possible to detect the reaction 'UDP-glucose <= > UDP + 1,3-beta-D-Glucan'. If the user follows the link associated with the reaction name, the underlying genes are indicated, which, in this case, include precisely the echinocandin encoding targets GSC1, GSL1 and GSL2.
Again, the most novel outcome of this new set of tools is obtained with the 'Essentiality' prediction query, option 'TFs', as it enables the identification of the TFs whose deletion is predicted to lead to biomass production flux below 1% of that of the wild-type strain, in the selected growth medium. Again, the user may choose 'RPMI 1640 medium' as the condition of choice. Once the 'Search' button is clicked, the table of results lists the TFs predicted to be essential in the pre-selected conditions ( Figure 4). Three TFs are predicted to be essential in 'RPMI 1640 medium'. Although none of them is encoded by a truly essential gene (whose deletion generates an unviable cell), the YEAS-TRACT+ modelling tools predict that in this medium, mimicking human serum, they are crucial for biomass production. Although the exact effect of TF deletion in the metabolic reaction fluxes is difficult to predict, it is interesting to observe, in Figure 4, that, for example, the Upc2 TF does indeed control the expression of 18 metabolic genes, seven of them being involved in ergosterol biosynthesis and predicted to be essential in the same environmental conditions. This result is consistent with UPC2 essentiality in these conditions.

FUTURE DIRECTIONS
The YEASTRACT+ team is committed to continuous update, and offer reliable and complete information on yeast transcription regulation to the international research community. As the scope of the database is expanded to cover a wider range of yeast species of biotechnological or medical interest, made easier with the creation of Communi-tyYeastract, it is expected that the ability to serve better our users increases. The expansion of the new network modelling tools to all yeast species for which a GSMM is available will be pursued, as well as the increase in the number of options offered in this context, particularly the possibility to predict synthetic lethality as a means to identify possible targets for combination therapy.

DATA AVAILABILITY
All data underlying this article are available through the YEASTRACT+ portal without restrictions (http:// yeastract-plus.org/). Flat files for computational analyses are shared on request.